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Abstract 


In this paper, we present algorithms and lower bounds for the Longest Increasing Subse- 
quence (LIS) and Longest Common Subsequence (LCS) problems in the data streaming model. 

For the problem of deciding whether the LIS of a given stream of integers drawn from 
{1,...,m} has length at least k, we discuss a one-pass streaming algorithm using O(k log m) 
space, with update time either O(log k) or O(log log m). For the problem of returning the actual 
longest increasing subsequence itself, we give a [log(1 + 1/e)]-pass streaming algorithm with 
update time O(log k) or O(loglogm) that uses space O(k'** log m), for any « > 0. We also 
prove a lower bound of Q(k) on the space required for any streaming algorithm for LIS, even 
when the input stream is a permutation of {1,...,m}. 

We discuss a simple LIS-based algorithm for LCS, and we also give several lower bounds 
on this problem, of which the strongest is the following: when the elements of two n-element 
streams are presented in an adversarial order, we need space (n/p?) to approximate the length 
of their LCS to within a factor of p, even when the two streams are permutations of each other. 


1 Introduction 


Longest increasing and common subsequences. Let S = %1,72,...,2%n be a sequence of 
n integers. A subsequence of S is a sequence %j,,%j,,...,2j, With 1) < tg < +--+ < am. Sucha 
subsequence is said to be increasing if xj, < vj, < +--+ < a,. In this paper, we consider two 


fundamental problems related to subsequences: 


e LONGEST INCREASING SUBSEQUENCE (LIS). Given a sequence S, find a maximum-length 
increasing subsequence of S (or find the length of such a subsequence). 


e LONGEST COMMON SUBSEQUENCE (LCS). Given two sequences S and 7, find a maximum- 
length sequence x which is a subsequence of both S and T (or find the length of x). 


Both LIS and LCS are fundamental combinatorial questions which have been well-studied in the 
computer science community [4, 6, 11, 16, 17, 22, among many others]. 
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Among a large number of important applications of both of these problems, we highlight a few 
that arise in computational biology. The BLAST (Basic Local Alignment Search Tool) [3] database 
supports queries of the following form: for a sequence o of amino acids, for example, what segments 
of known proteins have high local similarity to 0? Zhang [25] has proposed filtering the results 
of a BLAST query with an approach that uses an LIS algorithm as a black box to assemble the 
BLAST information about local similarity into a coherent picture of global similarity. An LIS step 
is also part of the MUMmer system for aligning entire genomes [8], and a straightforward LCS 
computation gives the value of the optimal alignment of two sequences of DNA [21]. 


The data streaming model. In the past few years, as we have witnessed the proliferation of 
truly massive data sets as diverse as fully sequenced genomes and the World Wide Web, traditional 
notions of efficiency have begun to appear inadequate. A polynomial-time algorithm—what is 
normally seen as the theoretical holy grail for a problem—may simply not be fast enough when run 
on an input like the multi-billion base pairs of the human genome. 

The theoretical computer science community has thus begun to explore new models of compu- 
tation, with new notions of efficiency, that more realistically capture when an algorithm is “fast 
enough.” The data streaming model [15] is one such well-studied model. In this model, an algo- 
rithm must make a small number of passes over the input data, processing each input element as 
it passes. Once the algorithm has seen an element, it is gone forever; thus we must compute and 
store a small amount of useful information about the previously read input. We are interested in 
algorithms that use a sublinear amount of additional space. (With a linear amount of space, a 
streaming algorithm can simply store the entire input and then run a traditional algorithm.) We 
typically aim for a polylogarithmic amount of space and a polylogarithmic amount of processing 
time for each element of the input. Ideal data streaming algorithms make only a single pass over 
the data, but we are also interested in multipass streaming algorithms, in which the algorithm can 
make a small number (typically constant) of passes over the input data. 


Our results: LIS and LCS in the data streaming model. In this paper, we study the 
difficulty of finding longest increasing subsequences and longest common subsequences in the data 
streaming model. We are motivated in our exploration by the fact that LIS and LCS are both 
fundamental combinatorial questions; we believe that a solid characterization of the tractability of 
basic questions like LIS and LCS will lead to a greater understanding of the power and limitations 
of the data streaming model. 

One notable obstacle that we face in the LIS problem is that, unlike many problems that have 
been previously considered in the streaming model, the LIS of a stream is an essentially global 
order-based property. Many of the problems that have been considered in the streaming model— 
for example, finding the most frequently occurring items in a stream [7, 9], clustering streaming 
data [14], or finding order statistics for a given stream [2, 19]—are entirely independent of the order 
of the elements presented in S; permuting the order of the items in the stream does not affect the 
correct answers to these questions. The problem of counting inversions in a stream [1]—i.e., the 
number of pairs of indices (i, 7) such that i < j but 2; > 2;—1is an inherently order-based problem, 
but much more local than that of LIS in the sense that an inversion is a relation between exactly 
two items in the stream, whereas an increasing subsequence of length @ is a relation among ¢ items. 

In this sense, the LIS problem is more closely aligned to estimating the histogram of the 
stream [12, 13]. However, the solution to the LIS may be incredibly sensitive to small changes 
in the data. For instance, consider an LIS that consists primarily of the same repeated value. If 
we change the data stream so that many occurences of this value are slightly smaller, it radically 


changes the LIS. Similar notions apply to LCS as well. While this does not preclude efficient 
streaming algorithms for LIS or LCS, it does suggest some of the difficulties. 

In this paper, we first present positive results on (1) computing the length of the LIS of a 
given input stream, and (2) outputting a maximum-length increasing sequence. We give a one-pass 
streaming algorithm that uses O(klogm) space to compute the length of the longest increasing 
subsequence for a given input stream, where m > max 7; is an upper bound on the largest element 
in the stream, and k is the length of the LIS. (This algorithm was also discovered independently 
by Fredman [11] and again by Bespamyatnikh and Segal [6], though not in the context of the 
data streaming model.) Our algorithm maintains values A[1...k’], where A[7] € {1,...,m} is the 
smallest possible last element of all increasing subsequences of length i in the part of the stream 
that has already been read, and k’ is the length of the LIS for the stream so far. As we read each 
element, we can update the array A in time O(log k). This algorithm can also be implemented using 
van Emde Boas queues or y-fast trees to achieve an update time of O(log log m) [23, 24]. For the 
problem of returning the length-k LIS of a given stream, we give a one-pass streaming algorithm 
that uses O(k” log m) space. In the context of multipass streaming algorithms, we reduce the space 
requirement to O(k'** log m) by using [log(1+1/e)] passes over the data. This is nearly optimal, 
since simply storing the LIS itself requires Q(k) space. 

We also present lower bounds on the LIS problem in the streaming model. In the comparison 
model, Fredman [11] has proven that nlogn — nloglogn + O(n) comparisons are necessary and 
sufficient to compute the LIS of an n-integer sequence, via a reduction from sorting. To the best 
of our knowledge, however, no lower bounds on LIS in the streaming model have been shown 
previously. As with many lower bounds on problems in the streaming model, our results are 
based upon the well-observed connection between the space required by a streaming algorithm and 
communication complexity. Specifically, a space-efficient streaming algorithm A to solve a problem 
gives rise to a solution to the corresponding two-party problem with low communication complexity; 
one party runs A on the first part of the input, transmits the small state of the algorithm to the 
other party, who then continues to run A on the remainder of the input. We prove a lower bound 
of Q(k) for computing the LIS of a stream whenever n = 0(k?), by giving a reduction from the 
SET-DISJOINTNESS problem, which is known to have high communication complexity. 

For the LCS problem, we discuss a simple LIS-based algorithm requiring O(n log m) space to 
compute the LCS of two n-element sequences presented as streams. If we want to compute the LCS 
of one n-element reference sequence against any number of test sequences, we can achieve the same 
space bound, independent of the number of test sequences. Our main results on LCS, however, are 
lower bounds. We prove that, if the two streams are general sequences, then we need 20(n) space to 
p-approximate the LCS of two streams of length Q(n) to within any factor p. If the given streams 
are n-element permutations, we prove that we need Q(n/p?) space to p-approximate the LCS. 


2 Algorithms for Longest Increasing Subsequence 


We begin by presenting positive results on the LIS problem, both for computing the length of an 
LIS, and for actually producing an LIS itself. We use a dynamic-programming style algorithm, 
maintaining the last element of the “best” increasing subsequence of length 7 seen so far, for each 
i less than or equal to the length of the LIS seen so far. 

The algorithm presented here to calculate the length of the LIS was also discovered indepen- 
dently by Fredman [11] and by Bespamyatnikh and Segal [6] in a context other than the data 
streaming model; we include the algorithm here because our multipass algorithm to produce the 
LIS is an extension of it. 


compute-LIS(X ) 


1 Alo] :=—1 

2. All| t= ee 

S (h= 

4 WHILE there are elements left in the stream X 
5 Read in the next element x; from X 

6 Find @ such that A[é] < x; < A[é+ 1]. 
7 Set A[é+ 1] := 2; 

8 IF ¢+1>K#' 

9 Set k':=k' +1 

10 Set A[k’ + 1] := co 


11 Output k’ 


Figure 1: Pseudocode to compute the length of the LIS in a given stream X. 


2.1 Computing the Length of an LIS 


Let S = 2%1,%2,...,2;,... be a stream of data, and consider a length-f increasing subsequence 
O = Li,,Lin,...,4;, Of S. Write last(o) := x;,. Let o; denotes the ith element in a subsequence 
o. For instance, last(7) := |). We say that o is (¢,7j)-minimal if last(c) is minimized over all 
length-¢ increasing subsequences of the substream 71, %2,...,2;. We will say that such a o is an 
(€,j)-minimal increasing sequence, or simply an (¢, 7)-MIS. 

Our algorithm for computing the length of the LIS is based on maintaining (¢, j)-minimal 
subsequences for all  € {1,...,k’} as we scan the stream, where k’ is the length of the longest 
subsequence in the stream so far. Specifically, the streaming algorithm works as follows: we main- 
tain an array A[1...k’], where, after we have scanned the first j elements of the stream, A[é] will 
store last(o) for an (¢,7)-MIS o. The algorithm updates each A[¢] as new elements from the stream 
arrive, and increases k’ as appropriate. See Figure 1 for the pseudocode. 


Lemma 2.1 After i iterations of the while loop in compute-LIS(), we have 


Alg = last(p) for p an (€,1)-MIS if €< LIS(a,...,2;). 
~ | 00 or uninitialized otherwise 


Proof. We proceed by induction on 7, after strengthening the stated property by adding the following 
to the induction hypothesis: 


(x) A[j] < Aly’) for all 7 < 7’ such that A[j], A[j’] are initialized. 


For 7 = 0, the property is vacuously true. For the inductive case, assume the desired properties 
were maintained after we read in the element x;_; from the stream. Now consider the moment at 
which we read the next element x; from the stream. Let ¢ be such that Alé] < 2; < A[@+ 1], as 
in the algorithm. It is clear that only subsequences of length €+ 1 or higher might have a new 
smallest last element. That is, 7; is only going to affect values in A with indices @+ 1 or higher. 

On the other hand, note that x; can only extend a previous increasing subsequence o if 0 ends 
with some element o),) < aj. For all such subsequences, o),) < aj < Al[é+ 1]. Hence by the 
induction hypothesis, o is of length @ or shorter. This implies that the sequence o’ = 0,2; is of 
length at most +1. Thus x; can only affect values in A with indices @+ 1 or lower. 


Indeed, we now have a new subsequence o’ of length +1 with x; as the last element. (We can 
extend the subsequence of length @ with last element A[¢]). So it is necessary and sufficient that 
we update A[é+4 1]. It is also clear that the new A[j]’s respect the ordering constraint. 


Theorem 2.2 We can decide whether the LIS of a given stream of integers from {1,...,m} has 
length at least a given number k, or compute the length k of the LIS of the given stream, with a one- 
pass streaming algorithm that uses O(klogm) space and has update time O(log k) or O(log log m). 


Proof. By Lemma 2.1, the length is correctly computed by LIS. Clearly, the decision problem can 
also be solved with a minor change to the output of this algorithm. 

For the space bound, observe that we keep k values in the range {1,...,m}, i.e., O(log m) bits 
each. The only non-constant step in the update operation is to find the @ such that A[é] < 2; < 
Alé+1]. This can be done in O(log k) time by binary search; alternatively, we can use a van Emde 
Boas queue [23] or y-fast trees [24] to support updates in O(log log m) time. 


2.2 Finding an LIS 


The algorithm described in the previous section only computes the length of the LIS, but does 
not find such a sequence. We now present a multipass streaming algorithm that actually finds a 
longest increasing subsequence. Specifically, our algorithm finds the length-k LIS of a stream using 
O(k'** logm) space in [log(1 + 1/e)] passes over the data. We first explain the one-pass version 
of the algorithm, and then subsequently generalize it to multiple passes. 


A one-pass algorithm. Consider an iteration in the decision algorithm in which we update, 
say, A[€+ 1] to x;. In other words, we have Alf] < 2; < Al€+ 1]. Then at this point, there 
is an increasing subsequence o of length + 1 whose last two elements are A[¢] and x;, since 2; 
appears later in the stream than A[j]. Unfortunately, at some future time the value A[j] may also 
be updated, and thus the old value is lost. (Thus, since the new Al] is later in the stream than 2; 
was, we can no longer reconstruct the last two elements of .) 

The straightforward fix for this difficulty is, for each @, to store the subsequence o° of length 
é that ends with A[é]. Thus, the algorithm maintains k sequences o!,...,0", taking a total of 
O(k? logm) space. When we update A[é+ 1] := 2;, we reset ot! = o,a;. This adds only 
a constant amount of extra running time per update, so the update time per element remains 
O(log k) or O(log log m), and the space requirement is O(k? log m). 


A two-pass algorithm. We now describe a two-pass algorithm that requires less space. The key 
modification is that during the first pass over the data, the algorithm only remembers part of each 
o, specifically every qth element (for a value of q to be specified below). For each ¢, we maintain 
~e eg e £ é 
T= 974, 9% 941) F2q419°°* 19) e-2 gay) oe 
q 


where of = A[¢], as before. The update rule for the first pass of the algorithm is then 


xetl a, vj if €=1 (mod gq) 
~ | all-but-last(o"),2; otherwise, 


where all-but-last(a") denotes the sequence o° with the last element of the sequence, A[¢] = of, 


omitted. Note that the space required for this entire pass is O(k? log m/q) when the length of the 
LIS is k. 


After the first pass is complete, we discard the subsequences ¢!,...,7*~', freeing a large amount 
of space. Thus the only information we retain is the subsequence 


~k ___ k _k k k k 
T= 975% 41) F 2941+ 1 9F) b=2 949% 
q 


where o* is a length-k LIS of the input. Write o* = z[1], z[2],..., 2[|(k — 2)/q]], o%. 

In the second pass, we want to “fill in the blanks” of the subsequence o* to produce o*. 
Specifically, we want to find an increasing subsequence 7‘ that starts with z[¢] and ends with z[¢+1] 
for each @. Notice that we can do this sequentially (for one ¢ at a time), since two consecutive T 
subsequences do not overlap except at the endpoints. Thus each desired subsequence has length 
exactly q +1, and the total space required for the entire second pass is O(q? log m + klog m). 

Overall, the total space required by our algorithm is O(max(k? log m/q, q? log m) + klogm). 
This is minimized at q = k2/3, giving us a space bound of O(ki+t/ 3 log m) for two passes. 


Generalizing to a p-pass algorithm. We can generalize this idea to a larger number of passes 
by computing the 7’ subsequences recursively. As before, in the first pass the algorithm remembers 
only every gth element in the subsequence, and discards all stored subsequences except o*. Then 
the algorithm uses p — 1 passes to find the roughly k/q subsequences T!,7?,... ,TUk-2)/4] | where 
each r* has length q. 

Let S(k,p) denote the space required by a p-pass algorithm to find a subsequence of length 
k. We then have the following recurrence: $(k, p) = max(O(k? log m/q), S(q, p — 1)) + O(klogm). 
Solving the recurrence, we find that the space requirements are optimized at g = k!~!/@’-1), and 
where S(k,p) = O(k!+/(—) log m). 


Theorem 2.3 Fiz any ¢ > 0. For a given k, we can find a length-k increasing subsequence of 
a given stream of integers from {1,...,m} with a [log(1 + 1/e)|-pass streaming algorithm that 
uses O(k't©logm) space and has update time O(logk) or O(loglogm). We can find the longest 
increasing subsequence of a stream even when its length k is not known in advance, using the same 
number of passes, the same update time, and space OGk log m). 


Proof. Given ¢ > 0, we choose p = [log(1+1/¢)]. Then the p-pass algorithm described above uses 
space O(k!** log m) to compute the LIS of the given stream. 

If k is unknown, then we modify the algorithm described above slightly. Define a recursive 
sequence by gg = 1 and qi = G+ Gt for alli > 0. Then for the first pass only, change the 
update rule to the following: 


etl ov; if 2 = q for some i 
~ | all-but-last(e*), x; otherwise. 


So after the first pass, we have retained the sequence 


= 0 Oo 


~~ qo? 9; 


qi? ~ qa?" * "9? ~ gt? 


where ¢ is the largest index such that q < k. 

By the recursion, the gap between adjacent indices i,i +1 < t for elements of o* we have 
retained is qj41 — qi = a < k!-*, In the standard algorithm where k is known in advance, we 
also have a gap of k!~*. So we can resume the standard algorithm from the second pass on, using 
the same time and space requirements. 


Now, for the first pass of the algorithm, the update time is identical to the standard algo- 
rithm. However, the space used is O(ktlogm). We now bound t. To this end, define Ig = 
fe? Sigg 2OOTE Let ig be the smallest index in Ig. Then for all 7 > 0, we see gi,4j; 2 
Gig + IG, > Gi, + 72°C. Hence, |I4| S (274? = 29) /29 9) < 208. 

Let I = {i: qi < k}. By definition, t < |J|. From the above, we have 


Igk Igk 
ke2°—1 — kE2° 
Il < Psy os < 
His 2 Nols 2 9—1 —eln2 


where the last inequality follows since e” > 1+ 2 for all x. So the total space used in the first pass 
is O(4k!** log m), as we wanted. 


3 Lower Bounds for LIS 


We now turn our attention to a lower bound on the space required for streaming algorithms solving 
the longest increasing subsequence problem. In this section, we prove that Q(k) bits of storage are 
required to decide if the LIS of a stream of N elements has length at least k, for any N = 0(k?). 
Our proof is based on a reduction from the set disjointness problem, which is known to have high 
communication complexity: 


Definition 3.1 (Set Disjointness) In the SET-DISJOINTNESS problem, there are two parties A 
and B who wish to solve the following problem. Party A holds an n-bit string s4, and Party B 
holds another n-bit string sp. They must decide whether there is at least one ‘1’ in the bitwise-and 
54 & 8p of sq and sp (i.e., decide if s4 and sp both have a ‘1’ in at least one position) while 
minimizing the number of bits communicated between the parties. 


We will say that s4 and sp intersect for a “yes” instance of SET-DISJOINTNESS. 

Lower bounds for the set disjointness problem are of fundamental importance, and have been 
studied extensively (e.g., [5, 18, 20]). The most recent results show that even in the randomized 
setting, SET-DISJOINTNESS requires a large number of bits of communication: 


Proposition 3.2 ({5]) Let 6 € (0,1/4). Any randomized protocol solving the SET-DISJOINTNESS 
problem with probability at least 1—6 requires at least 7(1— 2V/5) bits of communication, even when 
sa and sp both contain exactly n/4 ones. 


We now reduce SET-DISJOINTNESS to the problem of determining if an increasing subsequence 
of length VN exists in a stream of N elements. This reduction shows that—even if we allow 
randomization and some chance of error—deciding whether there is an increasing subsequence of 
length & requires Q(k) space in the streaming model. 

Suppose we are given an instance (s 4, 5g) of the SET-DISJOINTNESS problem, where n := |s.4| = 
|sp|. We will construct a stream lis-stream(s 4, 58) whose longest increasing subsequence has length 
n+ 1 if and only if (s4,sg) are non-disjoint. Further, the first half of lis-stream(s4, sg) depends 
only on s,, while its second half depends only on sg. With each index i € {1,...,n}, we associate 
the sequence (n+ 1)-(¢—1)+1,...,(m +1) -1, divided into two parts: the first 7 integers form the 
sequence A-part(i) = (n+ 1)-(¢-1) +1, (n+1)-(¢-1) +2,...,(n+1)-(¢—1) +7 and the remaining 
n—i+1 integers form the sequence B-part(z) = (n+1)-(i—1)+7+1, (n4+1)-(@—1)+i4+2,...,(n41)-2. 


Let lis-stream-A(s4) be the sequence consisting of A-part(z) for every i € {i : s4(i) = 1}, 
in decreasing order of the index i. Similarly, let lis-stream-B(sg) be the sequence consisting of 
B-part(i) for every 7 € {i : sp(t) = 1}, also listed in decreasing order of the index i. Clearly, 
lis-stream-A(s 4) (and lis-stream-B(sg), respectively) only depends on s, (sg, respectively). Then 
we define the stream lis-stream(s4, sg) to be lis-stream-A(s4) followed by lis-stream-B(sz). 

As an example (which we will return to throughout the paper), consider the 9-bit vectors 
ex4 = [0,1,0,1,1,0,0,0,0] and exg = [1,0,0,1,0,0,1,0,0]. Then n = 9 and 


lis-stream(ex4,exp) = 41,42,43, 44,45, 31,32,33,34, 11,12, 
68, 69,70, 35,36, 37,38, 39,40, 2,3,4,5,6,7,8,9, 10. 
Observe the increasing subsequence 31, 32,...,40 of length n + 1 = 10 in this stream. 
Lemma 3.3 The vectors s4 and sg intersect if and only if LIS(lis-stream(s4,5pB)) has length n+1. 


Proof. We prove the obvious direction first. If s4(7) = sg(7) = 1 for some particular 7, then observe 
that lis-stream(s,4, $s) contains the increasing subsequence A-part(z) B-part(i) = (n+1)-(¢-1)+ 
1,(n+1)-(@-—1)4+2,...,(m+ 1) -7 which contains n + 1 increasing integers. 

For the converse direction, we prove its contrapositive form. Suppose s 4 and sg do not intersect. 
Observe that whenever i < j we have that (1) A-part(i) follows A-part(j) in lis-stream-A(s 4), and 
(2) the integers in A-part(i) are all smaller than those in A-part(j). Thus any increasing subsequence 
within lis-stream-A(s 4)—or lis-stream-B(s), similarly—has length at most n, and can contain only 
the integers from A-part(z) for only a single i. Thus the only potential increasing subsequences 
of length n + 1 must be subsequences of A-part(i) B-part(j) for some indices 7 and j so that 
sa(t) = sp(j) = 1. (By assumption, then, we must have i 4 j.) Furthermore, unless i < J, 
all the integers in A-part(i) are larger than the integers in B-part(j). Thus the longest increasing 
subsequence in lis-stream(s 4, sg) is of length at most |A-part(¢)| + |B-part(j)| =i+n—j+1<n. 


We can improve the construction so that the resulting stream LIS(lis-stream(s4,5g)) is a per- 
mutation, i.e., a stream containing each of the numbers of {0,1,...,2} exactly once. We will show 
that a suitable 2 = O(n”) suffices. Our construction is an extension of the above. We modify 
lis-stream-A and lis-stream-B as follows: we include the integers from A-part(i) and B-part(i) even 
when s4(i) = 0 or sg(i) = 0, but so that only two of these elements can be part of an LIS: 


e Let U4 = {x: x € A-part(i) for some 7 such that s4(i) = 0}. Then we define pad-A(s,4) to be 
the sequence consisting of integers in Uy listed in decreasing order, followed by 0. We define 
lis-stream-perm-A(S'4) to be pad-A(s4) followed by lis-stream-A(s,). 


e Similarly, let Ug = {x : x © B-part(z) for some 7 such that sg(i) = 0}. Then we define 
pad-B(sp) to be the sequence consisting of (n + 1)-n-+ 1, followed by the integers in Ug 
listed in decreasing order. We define lis-stream-perm-B(S'g) to be lis-stream-B(sg) followed by 
pad-B(sp). 

Now define lis-stream-perm(s4, 58) := lis-stream-perm-A(s4) lis-stream-perm-B(sg). This stream 
consists of the “missing” elements of s4 in decreasing order, followed by 0, then followed by the 
“present” elements; then the “present” elements of sg, followed by (n+ 1)-n+ 1, followed by the 
“missing” elements of sg. In our previous example, then, 


lis-stream-perm(ex4,exp) = 89,---,81,78,---,71,67,---,61,56,---,51,23,---,21,1,0, 
AD oi AB. 31, e034, 11512, 
68.70) 85 23,40,-9 232.110, 
91, 90, 80, 79, 60, --»,57,50,-»-,46,30,---,24,20,...,13. 


One can easily verify that lis-stream-perm(s.4, sg) is a permutation of the set {0,...,(n+1)-n+l1}. 


Lemma 3.4 The vectors s4 and sg intersect if and only LIS(lis-stream-perm(s4,5p)) has length 
at leastn +3. 


Proof. Observe that the prefix of lis-stream-perm(s 4,58) ending with the element 0 is a decreasing 
sequence, as is the suffix starting with the element (n+1)-n+1. Thus any increasing subsequence 
of lis-stream-perm(s4, 58) can contain at most one element from each of these segments. Thus the 
following sequence must be a longest increasing subsequence of lis-stream-perm(s 4, Sg): first 0, then 
a longest increasing subsequence of lis-stream(s4, 5g), then (n+ 1)-n+1. By Lemma 3.3, then, 
the length of the longest increasing subsequence of lis-stream-perm(s 4, sg) is n+ 3 if and only if s4 
and sp intersect. 


Theorem 3.5 For any length k and for any N > k-(k—1)+2, any streaming algorithm which 
decides whether LIS(S) > k for a stream S which is a permutation of {1,...,N} with probability at 
least 3/4 requires Q(k) space. 


Proof. Suppose that an algorithm A(S') decides with probability at least 3/4 whether stream S, 
where |S| = N, contains an increasing subsequence of length k. We show how to solve an instance 
(sa, 8p) of the SET-DISJOINTNESS problem with |s4| = k — 1 = |sg| with probability at least 3/4 
by calling A. The stream we consider is 


S:=N-1,N-2,...,k-(k—1) +2, lis-stream-perm(s 4, sz). 
>, 


Extra Numbers 


Note that, as in the proof of Lemma 3.4, the longest increasing subsequence of S has exactly the 
same length as the longest increasing subsequence of lis-stream-perm(s 4, sg) since the prepended 
elements of S are all larger than those in lis-stream-perm(s 4, $8), and are presented in descending 
order. Thus, by Lemma 3.4, the LIS of S has length k—and A(S) returns true with probability at 
least 3/4—if and only if s4 and sg do not intersect. 

This immediately implies a lower bound on the space required by A by Proposition 3.2: to 
solve the instance (s4,5g) of the SET-DISJOINTNESS problem, Party A simulates the algorithm A 
on the stream Extra Numbers, lis-stream-perm-A(s 4), then sends all stored information to Party B, 
who continues simulating A on the remainder of the stream S. By Proposition 3.2, then, Party A 
must transmit at least 0(k) bits in this protocol, and thus A must use 2(k) space. 


4 Longest Common Subsequence 


In this section, we turn to the LCS problem. Recall that for LCS we are given two streams S; and 
So, consisting of n; and ng integers, respectively, drawn from the set {1,2,...,m}. Throughout 
this section, we consider the adversarial streaming model, in which elements from the two streams 
can be presented in any order of interleaving. Specifically, in the lower bounds that we construct 
in this section, the algorithm is given access to all of S, before having access to any of So. 

First, as with all streaming problems, observe that there is a trivial streaming algorithm that 
solves LCS using O(n, log m+ n2logm) space: we simply store both streams in their entirety, and 
then run a standard (non-streaming) LCS algorithm on the stored sequences. We can give another 
algorithmic upper bound for a version of LCS, based upon a simple connection between LIS and 


LCS. Suppose that we are first given one reference sequence R and then given a large number of test 
sequences S1,S2,...,Sq; we want to compute the LCS of R and S; for all 1 <i <q. Our streaming 
algorithm stores the permutation R as a lookup table, and then, for each S;, runs the LIS algorithm 
from Section 2, where we interpret two elements x and y to be in ordered x < y if x appears before y 
in R. If these are n-element sequences, then this algorithm requires space O(n log m) total space— 
O(nlog m) to store R, and O(k logm) = O(nlog m) for the LIS computation. Note that this bound 
is independent of gq. 

In the remainder of this section, we present some lower bounds for LCS, again using the SET- 
DISJOINTNESS problem. We first show an easy lower bound when Sj and S2 are not necessarily 
permutations, and then show a more involved bound for exact or approximate computation of the 
LCS for permutations. 


4.1. Lower Bound on Exact and Approximate LCS for General Sequences 


It is straightforward to see that if we allow the streams S; and Sg not to be permutations of each 
other, then the lower bound is trivial, even for approximation: 


Theorem 4.1 For any length N and any approximation ratio p, any streaming algorithm which p- 
approximates the LCS of two streams S,,S2 (in adversarial order) each of length N with probability 
at least 3/4 requires Q(N) space, even when the algorithm is presented all of S; followed by all of 
So. 


Proof. Let S be a sequence consisting of sequence S; followed by sequence S2, and suppose that an 
algorithm A(S) decides with probability at least 3/4 whether streams S; and S2 contain a common 
subsequence of length 1. We show how to solve an instance (s4,5g) of SET-DISJOINTNESS with 
|s4| = 4N = |sp|, where s4 and sg both contain exactly N ones, with probability at least 3/4 by 
using A. 

Let stream S; consist of all 7 such that s4(7) = 1, and let S2 consist of all i such that sg(i) = 
1. Thus S; and S2 have a common subsequence of length 1 if s4 and sg have at least one 
element in common and of length 0 otherwise. Thus, if A(S) outputs the correct answer within any 
approximation ratio, it must distinguish between the 0 case and the length 1 case. This implies the 
desired lower bound, since we can solve the SET-DISJOINTNESS using A. The first party simulates 
A on stream S;, then passes its state to the second party. The second party finishes simulating A 
on the rest of S, namely on Sp. By Proposition 3.2, this state must therefore use Q(V) space. 

To show that we still require (NN) space when one or both of the streams has length strictly 
larger than N, we simply add arbitrary new elements to each of the above streams. 


Although the above construction is for multiplicative approximation, a simple variation also shows 
that any data streaming algorithm solving this problem within additive a takes space at least 
Q(N/a); simply repeat each element in the streams 2a times. 


4.2 Lower Bound on Exact LCS for Permutations 


We now improve the construction to show a lower bound on the space required for an LCS algorithm 
even when the two streams S; and S2 are both permutations of the set {1,...,n}. 

Given an instance (s4, sg) of the SET-DISJOINTNESS problem where there are exactly n/4 ones 
in each s4 and sg, we construct two streams as follows: 


e Ics-perm-A(s4) consists of the sequence R followed by the sequence R4, where R,4 contains 
{i : s4(t) = 1} in increasing order of 7 and R4 contains {7 : s4(i) = 0} in decreasing order. 
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e Ics-perm-B(s,) consists of the sequence Rg followed by the sequence Rg, where Rg contains 
{i : sp(i) = 1} in increasing order and Rg contains {i : sp(i) = 0} in decreasing order. 


Lemma 4.2 The vectors s4 and sg intersect if and only if LCS(Ics-perm-A(s4), lcs-perm-B(sp)) 
has length at least n/2 + 2. 


Proof. If s4 and sg intersect, then we can construct a common subsequence of Ics-perm-A (s 4) and 
Ics-perm-B(sg) as follows. First choose the common element from R,4 and Rg. Since sq and sg 
intersect, the set {7 : s4(i) = sgp(i) = 0} must contain at least n/2 + 1 elements, since there are 
exactly n/4 ones in each s4 and sg. This implies a common subsequence of R4 and Rg of length 
n/2+ 1, and thus an overall common subsequence with total length n/2+ 2. 

On the other hand, if A and B have no common element, then none of the elements in R4 can 
be matched up with Rg. Of course, some elements in R4 might be matched with elements in Rg 
(or vice versa), but R, is in increasing order while Rp is in decreasing order, so at most one such 
element can be matched. Also R4 and Rp have exactly n /2 common elements, so R, can at best 
be matched with at most n/2 elements in Rg. Thus LCS(Ics-perm-A(s4), lcs-perm-B(sp)) can have 
length at most n/2+ 1. 


Theorem 4.3 For any length k and for any N > 2k —4, any streaming algorithm which decides 
whether LCS(S,,S2) > k for streams S,,S_ which are permutations of {1,...,N} with probability 
at least 3/4 requires Q(k) space. 


Proof. The theorem follows analogously to Theorem 4.1 when N = 2k — 4: deciding whether 
Ics-perm-A(s4) and Ics-perm-B(sg) have a common subsequence of length N/2 +2 = k requires 
Q(k) = Q(N) space, by Lemma 4.2. 

For larger N, we pad the streams, as in Theorem 3.5. Add the decreasing sequence N,N — 
1,N —2,...,2k —4-+ 1 to the beginning of Ics-perm-A(s 4) and add the increasing sequence 2k — 
4+ 1,2k —4+42,...,N to the end of Ics-perm-B(sg). Then any common subsequences of these 
extended sequences are either (1) contained entirely in the unextended portions of Ics-perm-A(s 4) 
and lIcs-perm-B(sg), or (2) have length at most one. Then, as before, the LCS has length & if and 
only if s4 and sg intersect, and thus we require (.(k) space to compute the LCS. 


4.3. Lower Bound on Approximating LCS for Permutations 


We now present lower bounds for the space required for approximation algorithms for LCS on 
permutations. Suppose that p is the desired approximation ratio. For each 7, we will construct 
sequences p-approx-A(i,s4) and p-approx-B(i,sg) so that the two sequences have a common subse- 
quence of length p? if s4(i) = sg(i) = 1, and so that the longest common subsequence has length 
at most p otherwise. For each i < n, both sequences are of length p?, and consist of integers from 
{(i—1)- p2 +1, (4-1) - p?4+2,...,(i—1)- p? + p*}. We define them as follows: 


e For s4(i) = 1, define p-approx-A(i,s.4) to be the increasing sequence (i — 1) - p? +1, (4-1) - 
p?+2,...,(i-—1)- p24 p?. If s4(i) = 0, then define p-approx-A(i,s4) to be the decreasing 
sequence (i —1)- p? + p*, (é—1)- p? + p?—-1,...,(@-1)-p? +1. 


e For sp(i) = 1, define p-approx-B(i,s) to be the increasing sequence (i — 1) - p? +.1,(i—1)- 
p? +2,...,(i—1)- p? + p?. When sp(i) = 0, we use a more complicated ordering of the p? 
numbers. Specifically, we use what we call the median sequence o of these p? numbers so that 
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the longest increasing subsequence and the longest decreasing subsequence of o both have 
length exactly p. In this case, we define p-approx-B(7,sB) to be the sequence 


Cae eee ale Sele se Slr, 
(Hd) +p? 2p. G1) <p? £202 le GS le + oH, 


G=DeP eG) etre) HoH DestL 


Given an instance (s4, sg) of the SET-DISJOINTNESS problem where there are exactly n/4 ones 
in each s4 and sg, we construct two streams as follows: 


Ics-p-approx-perm-A(s 4) = p-approx-A(1,5,4), p-approx-A(2,s.4),..., p-approx-A(n,s4) and 
Ics-p-approx-perm-B(sg) = p-approx-B(n,s 8), p-approx-B(n — 1,sg),..., p-approx-B(1,s3). 


Returning to our example from Section 3 where n = 9, we have 


Ics-2-approx-perm-A(ex4) = 4,3,2,1, 5,6,7,8, 12,11,10,9, 13,14,15,16, 17,18, 19, 20, 
24,23, 22,21, 28, 27, 26,25, 32,31, 30,29, 36,35, 34, 33. 

Ics-2-approx-perm-B(exg) = 34,33,36,35, 30,29, 32,31, 25,26, 27,28, 22,21, 24, 23, 
18;17-90,19° 43.14.45, 16.10, 9-19, 10.6.5 87s 1:9, 39 a 


Lemma 4.4 If s,4 and sg intersect, then LCS(Ics-p-approx-perm-A(s 4), lcs-p-approx-perm-B (sp)) 
has length at least p?. If s4 and sp do not intersect, then the length of the LCS is at most p. 


Proof. If s4 and sg intersect, say with s4(i) = sp(i) = 1, then we see that p-approx-A(i,s4) = 
p-approx-B(i,sg). Hence the sequence (i—1)-p?+1,...,7-p? has length p? and is a subsequence of 
both Ics-p-approx-perm-A(s 4) and Ics-p-approx-perm-B(sg). (In our example, 13,14, 15,16 is such a 
subsequence. ) 

On the other hand, suppose s4 and sg do not intersect. Recall that Ics-p-approx-perm-A (s 4) lists 
the p-approx-A(-,s,4) in increasing order, while Ics-p-approx-perm-B(sg) lists the p-approx-B(-,sg) in 
decreasing order. Thus any common subsequence can only have numbers that are a subsequence 
corresponding to exactly one index 7. Since s,4 and sg do no intersect, we know that for any index 
zi one of the three following cases holds: 


1. sa(t) = 1,sp(2) = 0. Then p-approx-A(i,s4) and p-approx-B(i,sg) have a longest common 
subsequence of length p, since one is an increasing sequence while the other is a median 
sequence. 


2. sa(t) = 0,sp(i) = 1. Then p-approx-A(i,s4) and p-approx-B(i,sg) have a longest common 
subsequence of length 1, since one is a decreasing sequence while the other is an increasing 
sequence. 


3. sa(z) = 0,spB(i) = 0. Then p-approx-A(i,s4) and p-approx-B(i,sg) have a longest common 
subsequence of length p, since one is a decreasing sequence and the other is a median sequence. 


Thus the LCS has length at most p when s, and sp do not intersect. 


Theorem 4.5 For any approximation ratio p, and for any N, any streaming algorithm which de- 
cides whether (i) LCS(S 1, S2) > p? or (ii) LCS(S1, S2) < p for streams S,, Sy which are permutations 
of {1,...,N} with probability at least 3/4 requires Q(N/p2) space. 
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Proof. As in our previous lower-bound theorems, we can solve an instance of the SET- DISJOINTNESS 
problem with |s4| = N/p? = |sp| as follows. By Lemma 4.4, deciding whether the constructed 
streams Ics-p-approx-perm-A(s4) and lIcs-p-approx-perm-B(sg) have an LCS of length (i) at least 
p” or (ii) at most p corresponds to deciding whether s4 and sg intersect. So a data stream 
algorithm A can be used to solve the SET-DISJOINTNESS problem. The first party simulates A on 
Ics-p-approx-perm-A(s,4), then passes the state of the algorithm to the second party. The second 
party finishes the simulation of A on Ics-p-approx-perm-B(sg). Again, by Proposition 3.2, this 
implies that we need Q(N/p?) space for this LCS decision procedure. 


Corollary 4.6 To p-approximate the LCS of N-element permutations, we need Q(N/p?) space. 


5 Conclusion and Future Work 


A classic theorem of Erdés and Szekeres follows from an elegant application of the pigeonhole 
principle: for any sequence S of n +1 numbers, there is either an increasing subsequence of S of 
length \/n or a decreasing subsequence of S of length \/n [10]. One of our original motivations for 
looking at the LIS problem was to consider the difficulty of deciding, given a stream S, whether 
(1) the length of the LIS of S is at least /|S| , (2) the length of the longest decreasing sequence is 
at least J/|S| , or (3) both. To do this, one needs an exact streaming algorithm for LIS; a minor 
modification to the median sequence in Section 4 shows that one can have an LIS of length \/n or 
length /n — 1 with a longest decreasing subsequence of length \/n or length /n + 1, respectively. 

Of course, in the streaming model one is usually interested in approximate algorithms using, 
say, polylogarithmic space. Our lower bounds for LCS show that one needs a large amount of 
space for any reasonable approximation. However, our lower bounds for the LIS problem say that a 
streaming algorithm that distinguishes between an LIS of length & and one of length k + 1 requires 
Q(k) space. It is an interesting open question whether one can use a small amount of space to 
approximate LIS in the streaming model. 
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